This is the template for the image recognition exercise.
Some general instructions, read these carefully:
Deadlines:
No extensions for the deadlines
If you encounter problems, Google first and if you can’t find an answer, ask for help
Grading
The exercise covers a part of the grading in this course. The course exam has 5 questions, 6 points of each. Exercise gives 6 points, i.e. the total score is 36 points.
From the template below, you can see how many exercise points can be acquired from each task. Exam points are given according to the table below:
Each student will grade one submission from a peer and their own submission. After each Part deadline, example results are given. Study them carefully and perform the grading according to the given instructions. Mean value from the peer grading and self-grading is used for the final points.
import uuid
# Run this cell only once and save the code. Use the same id code for each Part.
# Printing random id using uuid1()
print ("The id code is: ",end="")
##print (uuid.uuid1())
print ('73e94c3f-a547-11ed-a26a-38fc986e2b5b')
The id code is: 73e94c3f-a547-11ed-a26a-38fc986e2b5b
Read the original research article:
İ. Çınar and M. Koklu. Identification of rice varieties using machine learning algorithms. Journal of Agricultural Sciences, 28(2):307–325, 2022. doi: 10.15832/ankutbd.862482.
https://dergipark.org.tr/en/download/article-file/1513632
Will be written in Part 3
Make three folders in your working folder: "notebooks", "data" and "training_data". Save this notebook in "notebooks" folder.
Perform preparations for the data
Gather the feature data
Dimension features (6)
Gather all the features in one array or dataframe: one data point in one row, including all feature values in columns.
For each data point, include also information of the original image and the label (rice species). Save the data in "training_data" folder.
Select the best model for each classifier. Use 5-fold repeated cross validation with 3 repetitions (RepeatedKFold from sklearn). You can choose the hyperparameter ranges to use (i.e. from which values the best hyperparameters are selected if they are not stated below.)
For each classifier:
For random forest model, report the feature importance for each feature. Which features seem to be the most important? Does this correspond with the observations you made in the data exploration?
Ponder the model selection process. What things should be considered when selecting the model to be used?
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
import math
import seaborn as sns
import sklearn.decomposition as SD
from tqdm import tqdm
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
# read data
training_data = pd.read_csv('../training_data/training_data.csv', index_col = [0])
# display firs five rows
training_data.head()
| rice_type | path | r_mean | g_mean | b_mean | r_variance | g_variance | b_variance | r_skewness | g_skewness | ... | b_kurtosis | r_entropy | g_entropy | b_entropy | major_axis_length (px) | minor_axis_lenth (px) | contour_area (px) | contour_perimeter (px) | roundness | aspect_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | arborio | ../data/Arborio\Arborio (9204).jpg | 222.522289 | 222.462998 | 224.158894 | 414.157721 | 420.233364 | 331.755208 | -1.451926 | -1.451582 | ... | 3.853837 | 6.031748 | 6.041130 | 5.840202 | 129.655731 | 66.700142 | 6648.0 | 343.220343 | 0.709177 | 1.943860 |
| 1 | arborio | ../data/Arborio\Arborio (95).jpg | 225.105370 | 225.438161 | 231.659615 | 441.078683 | 460.979968 | 397.579497 | -1.039958 | -0.999778 | ... | 2.295804 | 6.098831 | 6.128041 | 5.903111 | 135.751190 | 69.416695 | 7221.0 | 357.019333 | 0.711908 | 1.955599 |
| 2 | arborio | ../data/Arborio\Arborio (6523).jpg | 207.078867 | 208.550371 | 218.762061 | 317.646307 | 338.739471 | 344.251594 | -0.909045 | -0.852550 | ... | 1.606971 | 5.986432 | 6.035016 | 6.021399 | 142.041885 | 73.650620 | 8065.5 | 378.433546 | 0.707721 | 1.928591 |
| 3 | arborio | ../data/Arborio\Arborio (7523).jpg | 196.267586 | 195.825107 | 205.800233 | 207.988053 | 209.938308 | 231.463681 | -1.117349 | -1.113135 | ... | 2.283707 | 5.726940 | 5.732542 | 5.769262 | 112.533409 | 57.882843 | 5013.5 | 295.722870 | 0.720412 | 1.944158 |
| 4 | arborio | ../data/Arborio\Arborio (11106).jpg | 200.921386 | 200.854356 | 203.404185 | 296.996302 | 299.498530 | 290.685397 | -0.693292 | -0.694011 | ... | 0.476972 | 5.963724 | 5.970757 | 5.946620 | 144.727615 | 74.547760 | 8303.0 | 377.102593 | 0.733712 | 1.941408 |
5 rows × 23 columns
# a dataset with only the features
features = training_data.copy().drop(['rice_type', 'path'], axis=1)
#separate feature column names
cols = features.columns
# z-score scaleign
scaler = StandardScaler()
scale = scaler.fit_transform(features)
#back to dataframe
features = pd.DataFrame(scale)
features.columns = cols
#display first five rows
display(features.head())
# scaled data and labels in one dataframe
training_data_scaled = pd.DataFrame(training_data['rice_type'])
training_data_scaled = training_data_scaled.join(features)
#display first five rows
display(training_data_scaled.head())
# separate labels to is own array
labels = training_data['rice_type']
#display first five rows
display(labels.head())
| r_mean | g_mean | b_mean | r_variance | g_variance | b_variance | r_skewness | g_skewness | b_skewness | r_kurtosis | ... | b_kurtosis | r_entropy | g_entropy | b_entropy | major_axis_length (px) | minor_axis_lenth (px) | contour_area (px) | contour_perimeter (px) | roundness | aspect_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.903478 | 0.468699 | -0.376299 | 1.609257 | 1.590499 | 1.133324 | -0.398200 | -0.129804 | 0.432525 | -0.558665 | ... | -0.788546 | 1.011422 | 0.943312 | 0.930546 | -1.155969 | 0.976376 | -0.261640 | -0.911635 | 1.189391 | -1.208987 |
| 1 | 1.193838 | 0.767847 | 0.272219 | 1.901135 | 2.028030 | 1.961155 | 0.185925 | 0.391209 | 0.884403 | -0.844676 | ... | -1.019225 | 1.205410 | 1.158406 | 1.051936 | -0.982381 | 1.214996 | 0.150479 | -0.665492 | 1.213981 | -1.198084 |
| 2 | -0.832487 | -0.930193 | -0.842913 | 0.562879 | 0.715431 | 1.290483 | 0.371545 | 0.560991 | 1.126394 | -0.883037 | ... | -1.121212 | 0.880376 | 0.928180 | 1.280185 | -0.803233 | 1.586901 | 0.757869 | -0.283511 | 1.176273 | -1.223169 |
| 3 | -2.047763 | -2.209699 | -1.963602 | -0.626038 | -0.667616 | -0.127980 | 0.076194 | 0.260488 | 0.937813 | -0.600138 | ... | -1.021016 | 0.129980 | 0.179593 | 0.793663 | -1.643583 | 0.201871 | -1.437223 | -1.758882 | 1.290571 | -1.208710 |
| 4 | -1.524638 | -1.704015 | -2.170766 | 0.338991 | 0.294068 | 0.616815 | 0.677459 | 0.743815 | 1.406361 | -1.095999 | ... | -1.288518 | 0.814711 | 0.769146 | 1.135892 | -0.726748 | 1.665706 | 0.928686 | -0.307252 | 1.410350 | -1.211264 |
5 rows × 21 columns
| rice_type | r_mean | g_mean | b_mean | r_variance | g_variance | b_variance | r_skewness | g_skewness | b_skewness | ... | b_kurtosis | r_entropy | g_entropy | b_entropy | major_axis_length (px) | minor_axis_lenth (px) | contour_area (px) | contour_perimeter (px) | roundness | aspect_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | arborio | 0.903478 | 0.468699 | -0.376299 | 1.609257 | 1.590499 | 1.133324 | -0.398200 | -0.129804 | 0.432525 | ... | -0.788546 | 1.011422 | 0.943312 | 0.930546 | -1.155969 | 0.976376 | -0.261640 | -0.911635 | 1.189391 | -1.208987 |
| 1 | arborio | 1.193838 | 0.767847 | 0.272219 | 1.901135 | 2.028030 | 1.961155 | 0.185925 | 0.391209 | 0.884403 | ... | -1.019225 | 1.205410 | 1.158406 | 1.051936 | -0.982381 | 1.214996 | 0.150479 | -0.665492 | 1.213981 | -1.198084 |
| 2 | arborio | -0.832487 | -0.930193 | -0.842913 | 0.562879 | 0.715431 | 1.290483 | 0.371545 | 0.560991 | 1.126394 | ... | -1.121212 | 0.880376 | 0.928180 | 1.280185 | -0.803233 | 1.586901 | 0.757869 | -0.283511 | 1.176273 | -1.223169 |
| 3 | arborio | -2.047763 | -2.209699 | -1.963602 | -0.626038 | -0.667616 | -0.127980 | 0.076194 | 0.260488 | 0.937813 | ... | -1.021016 | 0.129980 | 0.179593 | 0.793663 | -1.643583 | 0.201871 | -1.437223 | -1.758882 | 1.290571 | -1.208710 |
| 4 | arborio | -1.524638 | -1.704015 | -2.170766 | 0.338991 | 0.294068 | 0.616815 | 0.677459 | 0.743815 | 1.406361 | ... | -1.288518 | 0.814711 | 0.769146 | 1.135892 | -0.726748 | 1.665706 | 0.928686 | -0.307252 | 1.410350 | -1.211264 |
5 rows × 22 columns
0 arborio 1 arborio 2 arborio 3 arborio 4 arborio Name: rice_type, dtype: object
# initialize figure
fig = plt.figure(figsize=(20, 50))
# setting values to rows and column variables
rows = math.ceil(len(cols)/3)
columns = 3
# initialize x (place of sub figure)
x = 0
# box plot all features
# loop feature columns
for col in features.columns:
x = x + 1
fig.add_subplot(rows, columns, x)
plt.boxplot(features[col])
plt.title(col)
plt.show(fig)
# separate different rice types into their own datasets (needed for drawing the histograms)
arborio = training_data_scaled.loc[training_data_scaled['rice_type'] == 'arborio']
basmati = training_data_scaled.loc[training_data_scaled['rice_type'] == 'basmati']
jasmine = training_data_scaled.loc[training_data_scaled['rice_type'] == 'jasmine']
# initialize figure
fig = plt.figure(figsize=(20, 50))
# setting values to rows and column variables
rows = math.ceil(len(cols)/3)
columns = 3
x = 0
# plot histograms
# loop through feature columns
for col in features.columns:
x = x + 1
fig.add_subplot(rows, columns, x)
# plot each ricy type in different color
plt.hist([arborio[col], basmati[col], jasmine[col]], bins=15, label=['arborio', 'basmati', 'jasmine'])
plt.legend(loc='upper right')
plt.title(col)
plt.show(fig)
I tried using seaborn pairplot. The figures were so small and runing took a lot of time so I made a scatter plots of all features.
fig = plt.figure(figsize=(30, 300))
# setting values to rows and column variables
row_count = 47
column_count = 5
x = 0
#loop triou feature columns
columns = np.array(features.columns)
for col in tqdm(features.columns):
#delete first element from list. This way we don't plot the same figure twice
columns = np.delete(columns, 0)
#loop the columns that are left in the list
for c in columns:
x = x + 1
fig.add_subplot(row_count, column_count, x)
# "pairplot", make using scatter plot to get bigget figure sizes and to have axis name written near
plt.scatter(features[col], features[c])
plt.title("x: " + col + " y : "+ c)
plt.show(fig)
100%|██████████████████████████████████████████████████████████████████████████████████| 21/21 [00:09<00:00, 2.22it/s]
# initialize figure
fig = plt.figure(figsize=(30, 300))
# setting values to rows and column variables
row_count = 47
column_count = 5
x = 0
columns = np.array(features.columns)
#loop trought feature columns
for c in tqdm(columns):
x = x + 1
fig.add_subplot(row_count, column_count, x)
# scatter plots of features
plt.scatter(training_data_scaled['rice_type'], features[c])
plt.title(c)
plt.show(fig)
100%|██████████████████████████████████████████████████████████████████████████████████| 21/21 [00:02<00:00, 8.56it/s]
Fron histograms -> aspect ratio and roundness. The scatterplots seem to support this the plots with aspect ratio or roundness in them seem to have most distinct clusters (though we cant be sure they are the correct clusters with out coloring in the rice types). Colors don't have much effect in clustering.
# create PCA model
pca = SD.PCA().fit(features)
# transform data into new space
df_pca = pd.DataFrame(pca.transform(features))
# add transformed data back to original data frame
training_scaled_pca = pd.concat([training_data_scaled['rice_type'], df_pca], axis=1)
# figure texts
fig = plt.figure(figsize=(15, 15))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Rice_types",fontsize=20)
# some code to get colors and according labels correct
targets = ['arborio', 'basmati', 'jasmine']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = training_scaled_pca['rice_type'] == target
plt.scatter(training_scaled_pca.loc[indicesToKeep, 0]
, training_scaled_pca.loc[indicesToKeep, 1], c = color, s = 50)
plt.legend(targets,prop={'size': 15})
plt.show(fig)
We can see three distict clusters formed. We can propably
# create PCA model
pca_99 = SD.PCA(n_components = 0.99).fit(features)
# transform data into new space
df_pca_99 = pd.DataFrame(pca_99.transform(features))
# How many components are in the formed dataframe
print(len(df_pca_99.columns))
10
Select the best model for each classifier. Use 5-fold repeated cross validation with 3 repetitions (RepeatedKFold from sklearn). You can choose the hyperparameter ranges to use.
# initialize cross-validation here
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
#https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/
# initialize values
k_range = 201
k_accuracy = []
# loop through the k values to fin the best one
for k in range(1, k_range):
# initialize model
knn_model = KNeighborsClassifier(n_neighbors = k)
# evaluate the model performance with test data sets
knn_scores = cross_val_score(knn_model, features, labels, scoring='accuracy', cv=cv, n_jobs=-1)
k_accuracy.append(knn_scores.mean())
print('done')
done
# find the k that gives us the maximum value
maximum_value = max(k_accuracy)
print('Maximum accuracy is ', maximum_value)
print('This value is achieved with k-value of ', k_accuracy.index(maximum_value)+1)
Maximum accuracy is 0.9899999999999999 This value is achieved with k-value of 1
# initialize array
k_value = np.arange(1, k_range)
#find the index position of the best value
ymax = max(k_accuracy)
pos = k_accuracy.index(ymax)
#plot the figure
plt.title("Accuracy score for k neighbours")
plt.xlabel("k neigbours")
plt.ylabel("mean accuracy score")
plt.plot(k_value,k_accuracy)
# plot the point that gives us the best value
plt.plot(k_value[pos], k_accuracy[pos], 'r*')
plt.show()
hyperparameters:
For random forest model, report the feature importance for each feature. Which features seem to be the most important? Does this correspond with the observations you made in the data exploration?
# initialize a parameter array of tuples
parameters = []
for mf in range(1, 22):
for md in range(1, 11):
parameters.append([mf, md])
print('done')
done
rf_accuracy = []
# loop throu the parameter
for p in tqdm(parameters):
#initialixe random forest model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=p[1], max_features=p[0], random_state=42)
#evaluate model performance with test data
rf_scores = cross_val_score(rf_model, features, labels, scoring='accuracy', cv=cv, n_jobs=-1)
rf_accuracy.append(rf_scores.mean())
print('done')
100%|████████████████████████████████████████████████████████████████████████████████| 210/210 [04:33<00:00, 1.30s/it]
done
# find maximum value and the hyper parameter we achieve it with
maximum_value = max(rf_accuracy)
print('Maximum accuracy is ', maximum_value)
max_params = parameters[rf_accuracy.index(maximum_value)]
print('This value is accived with max_features of ', max_params[0] , 'and max_depth of ', max_params[1])
Maximum accuracy is 0.991111111111111 This value is accived with max_features of 3 and max_depth of 7
# initialize arrays for plot
parameter_count = len(parameters)
parameter_inds = np.arange(0, parameter_count)
#find position of max value
ymax = max(rf_accuracy)
pos = rf_accuracy.index(ymax)
#plot figure
plt.title("Accuracy score for random forest")
plt.xlabel("index of parameter combination array")
plt.ylabel("mean accuracy score")
plt.plot(parameter_inds, rf_accuracy)
# plot highest point
plt.plot(parameter_inds[pos], rf_accuracy[pos], 'r*')
plt.show()
# initialize model with best parameter combination
rf_model_max_params = RandomForestClassifier(n_estimators=100, max_depth=7, max_features=3, random_state=42)
#fit with data
rf_model_max_params.fit(features, labels)
# Find the importance of features
plt.barh(rf_model_max_params.feature_names_in_, rf_model_max_params.feature_importances_)
plt.show()
Use one hidden layer and Early stopping.
Hyperparameters:
# make and array of arrays for different combination of parameters
activation=['logistic', 'relu']
solver=['adam', 'sgd']
validation_fraction=[0.1, 0.5]
parameters = []
for size in range(1, 22):
for a in activation:
for s in solver:
for v in validation_fraction:
parameters.append([size, a, s, v])
print('done')
done
mlp_accuracy = []
# loop parameter combination to find best one
for p in tqdm(parameters):
# initialize model
MLPmodel = MLPClassifier(
random_state=42,
early_stopping=True,
hidden_layer_sizes=(p[0],),
activation=p[1],
solver=p[2],
validation_fraction=p[3],
)
# evaluate model performance
mlp_scores = cross_val_score(MLPmodel, features, labels, scoring='accuracy', cv=cv, n_jobs=-1)
mlp_accuracy.append(mlp_scores.mean())
print('done')
100%|████████████████████████████████████████████████████████████████████████████████| 168/168 [01:37<00:00, 1.73it/s]
done
# find max value and its parameter combination
maximum_value = max(mlp_accuracy)
print('Maximum accuracy is ', maximum_value)
max_params = parameters[mlp_accuracy.index(maximum_value)]
print(max_params)
Maximum accuracy is 0.9211111111111111 [20, 'logistic', 'adam', 0.5]
# Initialize arrays for ploting
parameter_count = len(parameters)
parameter_inds = np.arange(0, parameter_count)
# fins position of best value
ymax = max(mlp_accuracy)
pos = mlp_accuracy.index(ymax)
# plot accuracy curve
plt.title("Accuracy score for MLP")
plt.xlabel("index of parameter combination array")
plt.ylabel("mean accuracy score")
plt.plot(parameter_inds, mlp_accuracy)
# plot point of max value
plt.plot(parameter_inds[pos], mlp_accuracy[pos], 'r*')
plt.show()
Random forest gave us the best accuracy score but runing it took the longes time. KNN was faster and gave almost as good a result as random forest. MPL's accurcy wasn't as good as the other models. In this case I would choose random forest ot knn. KNN wa salso the easiest to implement. Based on time and easiness to implement I would choose KNN.